Overview

Principle Component Analysis is widely used in data exploration, dimension reduction, data visualization. The aim is to transform original data into uncorrelated linear combinations of the original data while keeping the information contained in the data. High dimensional data tends to show clusters in lower dimensional view.

Clustering Analysis is another form of EDA. Here we are hoping to group data points which are close to each other within the groups and far away between different groups. Clustering using PC’s can be effective. Clustering analysis can be very subjective in the way we need to summarize the properties within each group.

Both PCA and Clustering Analysis are so called unsupervised learning. There is no response variables involved in the process.

For supervised learning, we try to find out how does a set of predictors relate to some response variable of the interest. Multiple regression is still by far, one of the most popular methods. We use linear models as a working model for its simplicity and interpretability. It is important that we use domain knowledge as much as we can to determine the form of the response as well as the function format of the factors.

0.1 Objectives

  • PCA
  • SVD
  • Clustering Analysis
  • Linear Regression

0.2 Review materials

  • Study Module 2: PCA
  • Study Module 3: Clustering Analysis
  • Study Module 4: Multiple regression

0.3 Data needed

  • NLSY79.csv
  • brca_subtype.csv
  • brca_x_patient.csv

1 Case study 1: Self-seteem

Self-esteem generally describes a person’s overall sense of self-worthiness and personal value. It can play significant role in one’s motivation and success throughout the life. Factors that influence self-esteem can be inner thinking, health condition, age, life experiences etc. We will try to identify possible factors in our data that are related to the level of self-esteem.

In the well-cited National Longitudinal Study of Youth (NLSY79), it follows about 13,000 individuals and numerous individual-year information has been gathered through surveys. The survey data is open to public here. Among many variables we assembled a subset of variables including personal demographic variables in different years, household environment in 79, ASVAB test Scores in 81 and Self-Esteem scores in 81 and 87 respectively.

The data is store in NLSY79.csv.

Here are the description of variables:

Personal Demographic Variables

  • Gender: a factor with levels “female” and “male”
  • Education05: years of education completed by 2005
  • HeightFeet05, HeightInch05: height measurement. For example, a person of 5’10 will be recorded as HeightFeet05=5, HeightInch05=10.
  • Weight05: weight in lbs.
  • Income87, Income05: total annual income from wages and salary in 2005.
  • Job87, Job05: job type in 1987 and 2005, including Protective Service Occupations, Food Preparation and Serving Related Occupations, Cleaning and Building Service Occupations, Entertainment Attendants and Related Workers, Funeral Related Occupations, Personal Care and Service Workers, Sales and Related Workers, Office and Administrative Support Workers, Farming, Fishing and Forestry Occupations, Construction Trade and Extraction Workers, Installation, Maintenance and Repairs Workers, Production and Operating Workers, Food Preparation Occupations, Setters, Operators and Tenders, Transportation and Material Moving Workers

Household Environment

  • Imagazine: a variable taking on the value 1 if anyone in the respondent’s household regularly read magazines in 1979, otherwise 0
  • Inewspaper: a variable taking on the value 1 if anyone in the respondent’s household regularly read newspapers in 1979, otherwise 0
  • Ilibrary: a variable taking on the value 1 if anyone in the respondent’s household had a library card in 1979, otherwise 0
  • MotherEd: mother’s years of education
  • FatherEd: father’s years of education
  • FamilyIncome78

Variables Related to ASVAB test Scores in 1981

Test Description
AFQT percentile score on the AFQT intelligence test in 1981
Coding score on the Coding Speed test in 1981
Auto score on the Automotive and Shop test in 1981
Mechanic score on the Mechanic test in 1981
Elec score on the Electronics Information test in 1981
Science score on the General Science test in 1981
Math score on the Math test in 1981
Arith score on the Arithmetic Reasoning test in 1981
Word score on the Word Knowledge Test in 1981
Parag score on the Paragraph Comprehension test in 1981
Numer score on the Numerical Operations test in 1981

Self-Esteem test 81 and 87

We have two sets of self-esteem test, one in 1981 and the other in 1987. Each set has same 10 questions. They are labeled as Esteem81 and Esteem87 respectively followed by the question number. For example, Esteem81_1 is Esteem question 1 in 81.

The following 10 questions are answered as 1: strongly agree, 2: agree, 3: disagree, 4: strongly disagree

  • Esteem 1: “I am a person of worth”
  • Esteem 2: “I have a number of good qualities”
  • Esteem 3: “I am inclined to feel like a failure”
  • Esteem 4: “I do things as well as others”
  • Esteem 5: “I do not have much to be proud of”
  • Esteem 6: “I take a positive attitude towards myself and others”
  • Esteem 7: “I am satisfied with myself”
  • Esteem 8: “I wish I could have more respect for myself”
  • Esteem 9: “I feel useless at times”
  • Esteem 10: “I think I am no good at all”

1.1 Data preparation

Load the data. Do a quick EDA to get familiar with the data set. Pay attention to the unit of each variable. Are there any missing values?

## 'data.frame':    2431 obs. of  46 variables:
##  $ Subject       : int  2 6 7 8 9 13 16 17 18 20 ...
##  $ Gender        : chr  "female" "male" "male" "female" ...
##  $ Education05   : int  12 16 12 14 14 16 13 13 13 17 ...
##  $ Income87      : int  16000 18000 0 9000 15000 2200 27000 20000 28000 27000 ...
##  $ Job05         : chr  "4700 TO 4960: Sales and Related Workers" "10 TO 430: Executive, Administrative and Managerial Occupations" "7900 TO 8960: Setters, Operators and Tenders" "5000 TO 5930: Office and Administrative Support Workers" ...
##  $ Income05      : int  5500 65000 19000 36000 65000 8000 71000 43000 120000 64000 ...
##  $ Weight05      : int  160 187 175 246 180 235 160 188 173 130 ...
##  $ HeightFeet05  : int  5 5 5 5 5 6 5 5 5 5 ...
##  $ HeightInch05  : int  2 5 9 3 6 0 4 10 9 4 ...
##  $ Imagazine     : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ Inewspaper    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Ilibrary      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ MotherEd      : int  5 12 12 9 12 12 12 12 12 12 ...
##  $ FatherEd      : int  8 12 12 6 10 16 12 15 16 18 ...
##  $ FamilyIncome78: int  20000 35000 8502 7227 17000 20000 48000 15000 4510 50000 ...
##  $ Science       : int  6 23 14 18 17 16 13 19 22 21 ...
##  $ Arith         : int  8 30 14 13 21 30 17 29 30 17 ...
##  $ Word          : int  15 35 27 35 28 29 30 33 35 28 ...
##  $ Parag         : int  6 15 8 12 10 13 12 13 14 14 ...
##  $ Number        : int  29 45 32 24 40 36 49 35 48 39 ...
##  $ Coding        : int  52 68 35 48 46 30 58 58 61 54 ...
##  $ Auto          : int  9 21 13 11 13 21 11 18 21 18 ...
##  $ Math          : int  6 23 11 4 13 24 17 21 23 20 ...
##  $ Mechanic      : int  10 21 9 12 13 19 11 19 16 20 ...
##  $ Elec          : int  5 19 11 12 15 16 10 16 17 13 ...
##  $ AFQT          : num  6.84 99.39 47.41 44.02 59.68 ...
##  $ Esteem81_1    : int  1 2 2 1 1 1 2 2 2 1 ...
##  $ Esteem81_2    : int  1 1 1 1 1 1 2 2 2 1 ...
##  $ Esteem81_3    : int  4 4 3 3 4 4 3 3 3 3 ...
##  $ Esteem81_4    : int  1 2 2 2 1 1 2 2 2 1 ...
##  $ Esteem81_5    : int  3 4 3 3 1 4 3 3 3 3 ...
##  $ Esteem81_6    : int  3 2 2 2 1 1 2 2 2 2 ...
##  $ Esteem81_7    : int  1 2 2 3 1 1 3 2 2 1 ...
##  $ Esteem81_8    : int  3 4 2 3 4 4 3 3 3 3 ...
##  $ Esteem81_9    : int  3 3 3 3 4 4 3 3 3 3 ...
##  $ Esteem81_10   : int  3 4 3 3 4 4 3 3 3 3 ...
##  $ Esteem87_1    : int  2 1 2 1 1 1 1 2 1 1 ...
##  $ Esteem87_2    : int  1 1 2 1 1 1 1 2 1 1 ...
##  $ Esteem87_3    : int  4 4 4 3 4 4 4 3 4 4 ...
##  $ Esteem87_4    : int  1 1 2 1 1 1 2 2 1 4 ...
##  $ Esteem87_5    : int  2 4 4 4 4 4 4 3 4 4 ...
##  $ Esteem87_6    : int  2 1 2 2 1 1 2 2 1 1 ...
##  $ Esteem87_7    : int  2 2 2 1 1 2 2 2 2 1 ...
##  $ Esteem87_8    : int  3 3 4 2 4 4 4 3 4 3 ...
##  $ Esteem87_9    : int  3 2 3 2 4 4 3 3 3 4 ...
##  $ Esteem87_10   : int  4 4 4 2 4 4 4 3 4 4 ...
##     Subject         Gender           Education05      Income87    
##  Min.   :    2   Length:2431        Min.   : 6.0   Min.   :   -2  
##  1st Qu.: 1592   Class :character   1st Qu.:12.0   1st Qu.: 4500  
##  Median : 3137   Mode  :character   Median :13.0   Median :12000  
##  Mean   : 3504                      Mean   :13.9   Mean   :13399  
##  3rd Qu.: 4668                      3rd Qu.:16.0   3rd Qu.:19000  
##  Max.   :12140                      Max.   :20.0   Max.   :59387  
##     Job05              Income05         Weight05    HeightFeet05  
##  Length:2431        Min.   :    63   Min.   : 81   Min.   :-4.00  
##  Class :character   1st Qu.: 22650   1st Qu.:150   1st Qu.: 5.00  
##  Mode  :character   Median : 38500   Median :180   Median : 5.00  
##                     Mean   : 49415   Mean   :183   Mean   : 5.18  
##                     3rd Qu.: 61350   3rd Qu.:209   3rd Qu.: 5.00  
##                     Max.   :703637   Max.   :380   Max.   : 8.00  
##   HeightInch05     Imagazine       Inewspaper       Ilibrary       MotherEd   
##  Min.   : 0.00   Min.   :0.000   Min.   :0.000   Min.   :0.00   Min.   : 0.0  
##  1st Qu.: 2.00   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:11.0  
##  Median : 5.00   Median :1.000   Median :1.000   Median :1.00   Median :12.0  
##  Mean   : 5.32   Mean   :0.718   Mean   :0.861   Mean   :0.77   Mean   :11.7  
##  3rd Qu.: 8.00   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.00   3rd Qu.:12.0  
##  Max.   :11.00   Max.   :1.000   Max.   :1.000   Max.   :1.00   Max.   :20.0  
##     FatherEd    FamilyIncome78     Science         Arith           Word     
##  Min.   : 0.0   Min.   :    0   Min.   : 0.0   Min.   : 0.0   Min.   : 0.0  
##  1st Qu.:10.0   1st Qu.:11167   1st Qu.:13.0   1st Qu.:13.0   1st Qu.:23.0  
##  Median :12.0   Median :20000   Median :17.0   Median :19.0   Median :28.0  
##  Mean   :11.8   Mean   :21252   Mean   :16.3   Mean   :18.6   Mean   :26.6  
##  3rd Qu.:14.0   3rd Qu.:27500   3rd Qu.:20.0   3rd Qu.:25.0   3rd Qu.:32.0  
##  Max.   :20.0   Max.   :75001   Max.   :25.0   Max.   :30.0   Max.   :35.0  
##      Parag          Number         Coding          Auto           Math     
##  Min.   : 0.0   Min.   : 0.0   Min.   : 0.0   Min.   : 0.0   Min.   : 0.0  
##  1st Qu.:10.0   1st Qu.:29.0   1st Qu.:38.0   1st Qu.:10.0   1st Qu.: 9.0  
##  Median :12.0   Median :36.0   Median :48.0   Median :14.0   Median :14.0  
##  Mean   :11.2   Mean   :35.5   Mean   :47.1   Mean   :14.3   Mean   :14.3  
##  3rd Qu.:14.0   3rd Qu.:44.0   3rd Qu.:57.0   3rd Qu.:18.0   3rd Qu.:20.0  
##  Max.   :15.0   Max.   :50.0   Max.   :84.0   Max.   :25.0   Max.   :25.0  
##     Mechanic         Elec           AFQT         Esteem81_1     Esteem81_2  
##  Min.   : 0.0   Min.   : 0.0   Min.   :  0.0   Min.   :1.00   Min.   :1.00  
##  1st Qu.:11.0   1st Qu.: 9.0   1st Qu.: 31.9   1st Qu.:1.00   1st Qu.:1.00  
##  Median :14.0   Median :12.0   Median : 57.0   Median :1.00   Median :1.00  
##  Mean   :14.4   Mean   :11.6   Mean   : 54.7   Mean   :1.42   Mean   :1.42  
##  3rd Qu.:18.0   3rd Qu.:15.0   3rd Qu.: 78.2   3rd Qu.:2.00   3rd Qu.:2.00  
##  Max.   :25.0   Max.   :20.0   Max.   :100.0   Max.   :4.00   Max.   :4.00  
##    Esteem81_3     Esteem81_4     Esteem81_5     Esteem81_6     Esteem81_7  
##  Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.00  
##  1st Qu.:3.00   1st Qu.:1.00   1st Qu.:3.00   1st Qu.:1.00   1st Qu.:1.00  
##  Median :4.00   Median :2.00   Median :4.00   Median :2.00   Median :2.00  
##  Mean   :3.51   Mean   :1.57   Mean   :3.46   Mean   :1.62   Mean   :1.75  
##  3rd Qu.:4.00   3rd Qu.:2.00   3rd Qu.:4.00   3rd Qu.:2.00   3rd Qu.:2.00  
##  Max.   :4.00   Max.   :4.00   Max.   :4.00   Max.   :4.00   Max.   :4.00  
##    Esteem81_8     Esteem81_9    Esteem81_10    Esteem87_1     Esteem87_2 
##  Min.   :1.00   Min.   :1.00   Min.   :1.0   Min.   :1.00   Min.   :1.0  
##  1st Qu.:3.00   1st Qu.:3.00   1st Qu.:3.0   1st Qu.:1.00   1st Qu.:1.0  
##  Median :3.00   Median :3.00   Median :3.0   Median :1.00   Median :1.0  
##  Mean   :3.13   Mean   :3.16   Mean   :3.4   Mean   :1.38   Mean   :1.4  
##  3rd Qu.:4.00   3rd Qu.:4.00   3rd Qu.:4.0   3rd Qu.:2.00   3rd Qu.:2.0  
##  Max.   :4.00   Max.   :4.00   Max.   :4.0   Max.   :4.00   Max.   :4.0  
##    Esteem87_3     Esteem87_4    Esteem87_5     Esteem87_6     Esteem87_7  
##  Min.   :1.00   Min.   :1.0   Min.   :1.00   Min.   :1.00   Min.   :1.00  
##  1st Qu.:3.00   1st Qu.:1.0   1st Qu.:3.00   1st Qu.:1.00   1st Qu.:1.00  
##  Median :4.00   Median :1.0   Median :4.00   Median :2.00   Median :2.00  
##  Mean   :3.58   Mean   :1.5   Mean   :3.53   Mean   :1.59   Mean   :1.72  
##  3rd Qu.:4.00   3rd Qu.:2.0   3rd Qu.:4.00   3rd Qu.:2.00   3rd Qu.:2.00  
##  Max.   :4.00   Max.   :4.0   Max.   :4.00   Max.   :4.00   Max.   :4.00  
##    Esteem87_8    Esteem87_9    Esteem87_10  
##  Min.   :1.0   Min.   :1.00   Min.   :1.00  
##  1st Qu.:3.0   1st Qu.:3.00   1st Qu.:3.00  
##  Median :3.0   Median :3.00   Median :3.00  
##  Mean   :3.1   Mean   :3.06   Mean   :3.37  
##  3rd Qu.:4.0   3rd Qu.:4.00   3rd Qu.:4.00  
##  Max.   :4.0   Max.   :4.00   Max.   :4.00
## [1] 0

There are no missing values. However, there are negatives numbers in variables that does not make sense. For Income87 the minimum value is -2 and for HeightFeet05 the minimum value is -4.

1.2 Self esteem evaluation

Let concentrate on Esteem scores evaluated in 87.

  1. Reverse Esteem 1, 2, 4, 6, and 7 so that a higher score corresponds to higher self-esteem. (Hint: if we store the esteem data in data.esteem, then data.esteem[, c(1, 2, 4, 6, 7)] <- 5 - data.esteem[, c(1, 2, 4, 6, 7)] to reverse the score.)

  2. Write a brief summary with necessary plots about the 10 esteem measurements.

Answer: A majority of the responses for all the esteem measurements fall on the higher end (3 or 4). However there are esteem measurements that have higher counts of 4s (1, 2, 3, and 5), relatively equal counts of 3s and 4s as the majority (4, 6, and 10), and ones that have a higher count of 3s (7, 8, and 9).

  1. Do esteem scores all positively correlated? Report the pairwise correlation table and write a brief summary.

##             Esteem87_1 Esteem87_2 Esteem87_3 Esteem87_4 Esteem87_5 Esteem87_6
## Esteem87_1       1.000      0.704      0.448      0.528      0.399      0.464
## Esteem87_2       0.704      1.000      0.443      0.551      0.402      0.481
## Esteem87_3       0.448      0.443      1.000      0.408      0.549      0.410
## Esteem87_4       0.528      0.551      0.408      1.000      0.381      0.509
## Esteem87_5       0.399      0.402      0.549      0.381      1.000      0.405
## Esteem87_6       0.464      0.481      0.410      0.509      0.405      1.000
## Esteem87_7       0.379      0.410      0.343      0.422      0.370      0.600
## Esteem87_8       0.273      0.283      0.351      0.295      0.381      0.409
## Esteem87_9       0.236      0.259      0.349      0.287      0.354      0.364
## Esteem87_10      0.312      0.330      0.460      0.366      0.436      0.442
##             Esteem87_7 Esteem87_8 Esteem87_9 Esteem87_10
## Esteem87_1       0.379      0.273      0.236       0.312
## Esteem87_2       0.410      0.283      0.259       0.330
## Esteem87_3       0.343      0.351      0.349       0.460
## Esteem87_4       0.422      0.295      0.287       0.366
## Esteem87_5       0.370      0.381      0.354       0.436
## Esteem87_6       0.600      0.409      0.364       0.442
## Esteem87_7       1.000      0.389      0.352       0.390
## Esteem87_8       0.389      1.000      0.430       0.438
## Esteem87_9       0.352      0.430      1.000       0.579
## Esteem87_10      0.390      0.438      0.579       1.000

Answer: The esteem scores are all positively correlated with each other. Esteem 1 and 2 have the strongest pairwise correlation with an r of 0.704. Esteem score 8 seems to have the weakest correlation with the other esteems with none of the r values being above 0.5. Esteem score 7 and 9 is also on the weaker side. The esteem scores that have a similar histogram as in part 2 have a stronger correlation while having a weeker correlation with histograms that differ in shape.

  1. PCA on 10 esteem measurements. (centered but no scaling)
## Standard deviations (1, .., p=10):
##  [1] 2.166 1.102 0.911 0.809 0.770 0.704 0.671 0.635 0.613 0.542
## 
## Rotation (n x k) = (10 x 10):
##               PC1     PC2      PC3    PC4     PC5      PC6     PC7      PC8
## Esteem87_1  0.324 -0.4452  0.10706 -0.216  0.2366 -0.32251  0.0125 -0.08076
## Esteem87_2  0.333 -0.4283  0.05545 -0.234  0.1994 -0.27710  0.0628 -0.03945
## Esteem87_3  0.322  0.0115  0.51291  0.245 -0.1717  0.00722 -0.6207  0.39498
## Esteem87_4  0.324 -0.2877 -0.10925 -0.211 -0.0925  0.82730  0.1261  0.17362
## Esteem87_5  0.315  0.0793  0.45088  0.471 -0.1248  0.01895  0.6449 -0.18398
## Esteem87_6  0.347 -0.0492 -0.41856  0.181 -0.2012  0.01423 -0.1951 -0.30635
## Esteem87_7  0.315  0.0196 -0.54957  0.335 -0.2752 -0.27829  0.0638  0.27919
## Esteem87_8  0.280  0.3619 -0.13974  0.236  0.8238  0.15725 -0.0958  0.00386
## Esteem87_9  0.277  0.4917 -0.00179 -0.516 -0.0809 -0.17689  0.2776  0.46417
## Esteem87_10 0.318  0.3918  0.10200 -0.323 -0.2217  0.02647 -0.2246 -0.62020
##                 PC9     PC10
## Esteem87_1  -0.0113  0.68676
## Esteem87_2   0.0804 -0.72056
## Esteem87_3  -0.0327 -0.02785
## Esteem87_4   0.1198  0.05182
## Esteem87_5  -0.0594 -0.00206
## Esteem87_6  -0.7034 -0.04151
## Esteem87_7   0.4975  0.05915
## Esteem87_8   0.0499 -0.00864
## Esteem87_9  -0.2905  0.01545
## Esteem87_10  0.3811  0.01244
a) Report the PC1 and PC2 loadings. Are they unit vectors? Are they orthogonal?


```
##               PC1     PC2
## Esteem87_1  0.324 -0.4452
## Esteem87_2  0.333 -0.4283
## Esteem87_3  0.322  0.0115
## Esteem87_4  0.324 -0.2877
## Esteem87_5  0.315  0.0793
## Esteem87_6  0.347 -0.0492
## Esteem87_7  0.315  0.0196
## Esteem87_8  0.280  0.3619
## Esteem87_9  0.277  0.4917
## Esteem87_10 0.318  0.3918
```

By definition, all loadings are unit 1 so they are unit vectors and they are all perpendicular.

b) Are there good interpretations for PC1 and PC2? (If loadings are all negative, take the positive loadings for the ease of interpretation)

PC1 is the weighted sum of the scores and also shows which are more strongly correlated with each other. The loadings for esteem score 8 and 9 align with the answer in question 3 that they have a weaker correlation. Esteem score 2 and 6 seem to have the stronger ones. PC2 shows the difference of the esteems, as some increase (esteem 3, 5, 7, 8, 9, 10), the others decrease (esteem 1, 2, 4, 6) and vice versa. 

c) How is the PC1 score obtained for each subject? Write down the formula.

The PC1 score is obtained by multiplying the actual values by the loadings. In this case it would be: Esteem87_1 * 0.324 + Esteem87_2 * 0.333 + Esteem87_3 * 0.322 + Esteem87_4 * 0.324 + Esteem87_5 * 0.315 + Esteem87_6 * 0.347 + Esteem87_7 * 0.315 + Esteem87_8 * 0.280 + Esteem87_9 * 0.277 + Esteem87_10 * 0.318

d) Are PC1 scores and PC2 scores in the data uncorrelated?

All PC scores are uncorrelated.

e) Plot PVE (Proportion of Variance Explained) and summarize the plot.


```
##                          PC1   PC2   PC3    PC4    PC5    PC6    PC7    PC8
## Standard deviation     2.166 1.102 0.911 0.8090 0.7699 0.7037 0.6714 0.6346
## Proportion of Variance 0.469 0.121 0.083 0.0654 0.0593 0.0495 0.0451 0.0403
## Cumulative Proportion  0.469 0.590 0.673 0.7388 0.7981 0.8477 0.8927 0.9330
##                           PC9   PC10
## Standard deviation     0.6133 0.5421
## Proportion of Variance 0.0376 0.0294
## Cumulative Proportion  0.9706 1.0000
```

<img src="hw2_sp2022_files/figure-html/unnamed-chunk-7-1.png" width="768" /><img src="hw2_sp2022_files/figure-html/unnamed-chunk-7-2.png" width="768" />

The plot has a steep drop at 2 PCs so that indicates that 2 leading PCs should be enough for most purposes since a large portion of the variances can be explained by those 2. At 1 PC, there is nearly a 0.5 PVE then the chart drops when at 2 PCs to a bit above 0.1 PVE. After the second PC, the PVE levels out a bit.

f) Also plot CPVE (Cumulative Proportion of Variance Explained). What proportion of the variance in the data is explained by the first two principal components?

<img src="hw2_sp2022_files/figure-html/unnamed-chunk-8-1.png" width="768" />

About 0.6 of the variance in the data is explained by the first two principal components.

g) PC’s provide us with a low dimensional view of the self-esteem scores. Use a biplot with the first two PC's to display the data.  Give an interpretation of PC1 and PC2 from the plot. (try `ggbiplot` if you could, much prettier!)

<img src="hw2_sp2022_files/figure-html/unnamed-chunk-9-1.png" width="768" />

Loadings for PC1 are similar in magnitudes and with same signs. PC2 captures the difference of total of esteem 3, 5, 7, 8, 9, 10 and total of esteem 1, 2, 4, 6. Esteem 1 and 2 are highly correlated with each other. 3 and 7 are also highly coorelated, as well as 8 and 10.
  1. Apply k-means to cluster subjects on the original esteem scores

    1. Find a reasonable number of clusters using within sum of squared with elbow rules.
    ## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

    There is a bend at number of clusters = 2 so a reasonable number of clusters to use is 2.

    1. Can you summarize common features within each cluster?
    ## List of 9
    ##  $ cluster     : int [1:2431] 1 2 2 1 2 2 2 1 2 2 ...
    ##  $ centers     : num [1:2, 1:10] 1.63 1.11 1.67 1.11 3.28 ...
    ##   ..- attr(*, "dimnames")=List of 2
    ##   .. ..$ : chr [1:2] "1" "2"
    ##   .. ..$ : chr [1:10] "Esteem87_1" "Esteem87_2" "Esteem87_3" "Esteem87_4" ...
    ##  $ totss       : num 8775
    ##  $ withinss    : num [1:2] 3353 2446
    ##  $ tot.withinss: num 5799
    ##  $ betweenss   : num 2976
    ##  $ size        : int [1:2] 1264 1167
    ##  $ iter        : int 1
    ##  $ ifault      : int 0
    ##  - attr(*, "class")= chr "kmeans"
    ##   Esteem87_1 Esteem87_2 Esteem87_3 Esteem87_4 Esteem87_5 Esteem87_6 Esteem87_7
    ## 1       1.63       1.67       3.28       1.80       3.18       1.94       2.04
    ## 2       1.11       1.11       3.90       1.18       3.90       1.22       1.37
    ##   Esteem87_8 Esteem87_9 Esteem87_10
    ## 1       2.70       2.67        2.97
    ## 2       3.53       3.49        3.80

    From looking at the centers for the two clusters, the first cluster has more extreme scores (either closer to 1 or 4). Meanwhile the second cluster has more mild scores (closer to the middle range of scores (2 and 3)).

    1. Can you visualize the clusters with somewhat clear boundaries? You may try different pairs of variables and different PC pairs of the esteem scores.

    For education and AFQT, there is not much of a clear boundry on the clusters. However, a majority of the people with education less than 10 does seem to fall under the second cluster. When looking at income and family income, there is a larger clump of cluster 2 wiht a lower income and family income. There does not seem to be any clear boundaries with the clusters when looking at height and weight. When looking at PC1, there is a clear separation between the two clusters. Points with a negative PC1 are largely in cluster 2 while positive PC1s are largely cluster 1.

  2. We now try to find out what factors are related to self-esteem? PC1 of all the Esteem scores is a good variable to summarize one’s esteem scores. We take PC1 as our response variable.

    1. Prepare possible factors/variables:
    • Personal information: gender, education (05, problematic), log(income) in 87, job type in 87, Body mass index as a measure of health (The BMI is defined as the body mass divided by the square of the body height, and is universally expressed in units of kg/m²). Since BMI is measured in 05, this will not be a good practice to be inclueded as possible variables.

    • Household environment: Imagazine, Inewspaper, Ilibrary, MotherEd, FatherEd, FamilyIncome78. Do set indicators Imagazine, Inewspaper and Ilibrary as factors.

    • Use PC1 of SVABS as level of intelligence

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      -2    4500   12000   13399   19000   59387
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   11167   20000   21252   27500   75001
b)   Run a few regression models between PC1 of all the esteem scores and suitable variables listed in a). Find a final best model with your own criterion. 

  - How did you land this model? Run a model diagnosis to see if the linear model assumptions are reasonably met. 
    
  - Write a summary of your findings. In particular, explain what and how the variables in the model affect one's self-esteem. 
    

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = PC1 ~ Education05 + LogIncome87 + BMI + Imagazine + 
##     Inewspaper + Ilibrary + MotherEd + FatherEd + LogFamilyIncome78, 
##     data = esteem_prepared)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.443 -0.363  0.097  0.300  2.211 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        7.28047    0.15294   47.60   <2e-16 ***
## Education05       -0.00200    0.00567   -0.35    0.725    
## LogIncome87        0.00109    0.00391    0.28    0.781    
## BMI               -0.00160    0.00189   -0.85    0.397    
## Imagazine1         0.03465    0.03036    1.14    0.254    
## Inewspaper1        0.06688    0.03932    1.70    0.089 .  
## Ilibrary1         -0.04450    0.03124   -1.42    0.154    
## MotherEd           0.00364    0.00632    0.58    0.565    
## FatherEd           0.00121    0.00467    0.26    0.795    
## LogFamilyIncome78  0.02311    0.01394    1.66    0.097 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.611 on 2421 degrees of freedom
## Multiple R-squared:  0.00643,    Adjusted R-squared:  0.00274 
## F-statistic: 1.74 on 9 and 2421 DF,  p-value: 0.0749
## 
## Call:
## lm(formula = PC1 ~ Education05 + LogIncome87 + BMI + Imagazine + 
##     Inewspaper + Ilibrary + MotherEd + FatherEd + LogFamilyIncome78, 
##     data = esteem_prepared_remove0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.463 -0.354  0.076  0.306  2.196 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.76e+00   2.17e-01   31.17  < 2e-16 ***
## Education05       -1.87e-03   6.05e-03   -0.31     0.76    
## LogIncome87        5.89e-02   1.36e-02    4.32  1.6e-05 ***
## BMI               -1.62e-03   1.97e-03   -0.82     0.41    
## Imagazine1         4.50e-02   3.28e-02    1.37     0.17    
## Inewspaper1        3.55e-02   4.27e-02    0.83     0.41    
## Ilibrary1         -3.40e-02   3.32e-02   -1.03     0.31    
## MotherEd          -3.94e-05   6.71e-03   -0.01     1.00    
## FatherEd           1.76e-03   4.94e-03    0.36     0.72    
## LogFamilyIncome78  2.53e-02   1.85e-02    1.37     0.17    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.608 on 2110 degrees of freedom
## Multiple R-squared:  0.0149, Adjusted R-squared:  0.0107 
## F-statistic: 3.55 on 9 and 2110 DF,  p-value: 0.000222
## 
## Call:
## lm(formula = PC1 ~ LogIncome87 + Imagazine + LogFamilyIncome78, 
##     data = esteem_prepared_remove0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.471 -0.354  0.077  0.307  2.195 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.6764     0.1989   33.56   <2e-16 ***
## LogIncome87         0.0598     0.0135    4.42    1e-05 ***
## Imagazine1          0.0492     0.0307    1.60     0.11    
## LogFamilyIncome78   0.0281     0.0177    1.59     0.11    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.608 on 2116 degrees of freedom
## Multiple R-squared:  0.0138, Adjusted R-squared:  0.0124 
## F-statistic: 9.84 on 3 and 2116 DF,  p-value: 1.9e-06

Answer: Our final model is PC1 = 6.6764 + 0.0598 * LogIncome87 + 0.0492 * Imagazine + 0.0281 * LogFamilyIncome78. To come to this model, we picked out the variables with stronger correlations when running the regression models between PC1 and all of the esteem scores. We also ran regression tests with all the variables and picked out ones that had a more significant p value. From the Residuals vs Fitted and qqnormal Plot, the linear model assumptions are reasonably met. Overall, we found that LogIncome87 had the strongest positive correlation with self-esteem, then Imagazine, and then LogFamilyIncome78 should everything else be held constant. In other words, someone in a household where someone regularly reads magazines and with a higher personal and family income tends to have a higher self-esteem.

2 Case study 2: Breast cancer sub-type

The Cancer Genome Atlas (TCGA), a landmark cancer genomics program by National Cancer Institute (NCI), molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. The genome data is open to public from the Genomic Data Commons Data Portal (GDC).

In this study, we focus on 4 sub-types of breast cancer (BRCA): basal-like (basal), Luminal A-like (lumA), Luminal B-like (lumB), HER2-enriched. The sub-type is based on PAM50, a clinical-grade luminal-basal classifier.

  • Luminal A cancers are low-grade, tend to grow slowly and have the best prognosis.
  • Luminal B cancers generally grow slightly faster than luminal A cancers and their prognosis is slightly worse.
  • HER2-enriched cancers tend to grow faster than luminal cancers and can have a worse prognosis, but they are often successfully treated with targeted therapies aimed at the HER2 protein.
  • Basal-like breast cancers or triple negative breast cancers do not have the three receptors that the other sub-types have so have fewer treatment options.

We will try to use mRNA expression data alone without the labels to classify 4 sub-types. Classification without labels or prediction without outcomes is called unsupervised learning. We will use K-means and spectrum clustering to cluster the mRNA data and see whether the sub-type can be separated through mRNA data.

We first read the data using data.table::fread() which is a faster way to read in big data than read.csv().

  1. Summary and transformation

    1. How many patients are there in each sub-type?
## brca_subtype
## Basal  Her2  LumA  LumB 
##   208    91   628   233

This table shows the amount of patients in each subtype.

b) Randomly pick 5 genes and plot the histogram by each sub-type.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

c) Remove gene with zero count and no variability. Then apply logarithmic transform.
## 
## Attaching package: 'matrixStats'
## The following object is masked from 'package:dplyr':
## 
##     count
  1. Apply kmeans on the transformed dataset with 4 centers and output the discrepancy table between the real sub-type brca_subtype and the cluster labels.
##             
## brca_subtype   1   2   3   4
##        Basal  17   3   1 187
##        Her2    9  23  43  16
##        LumA   86 147 395   0
##        LumB   22 105 104   2
  1. Spectrum clustering: to scale or not to scale?

    1. Apply PCA on the centered and scaled dataset. How many PCs should we use and why? You are encouraged to use irlba::irlba(). We should use 4 PCs given that it is the elbow of the graph of the approximated proportion of variance explained. Essentially, until PC4, there are more dramatic increases in PVE and it starts leveling out after that.

    2. Plot PC1 vs PC2 of the centered and scaled data and PC1 vs PC2 of the centered but unscaled data side by side. Should we scale or not scale for clustering propose? Why? (Hint: to put plots side by side, use gridExtra::grid.arrange() or ggpubr::ggrrange() or egg::ggrrange() for ggplots; use fig.show="hold" as chunk option for base plots) From these plots, the unscaled data actually seems to be better for clustering, as the data in the scaled plot is much more similar and it seems harder to distinguish groups from the data. Meanwhile, in the unscaled graph we can easily see 2 groups and the data would be easier to split into clusters.

  2. Spectrum clustering: center but do not scale the data

    1. Use the first 4 PCs of the centered and unscaled data and apply kmeans. Find a reasonable number of clusters using within sum of squared with the elbow rule. Using the elbow rule, we can reasonably decide on k = 3 or 4, but given the context of this case, we’ll use 4 clusters.

    2. Choose an optimal cluster number and apply kmeans. Compare the real sub-type and the clustering label as follows: Plot scatter plot of PC1 vs PC2. Use point color to indicate the true cancer type and point shape to indicate the clustering label. Plot the kmeans centroids with black dots. Summarize how good is clustering results compared to the real sub-type.

Our results are pretty good for some subtypes of cancer. We can see that Cluster 3 in particular is a great fit for basal, as it is the group we observed earlier which we noticed was most separate from the rest of the data. We can also see that although there are many lumA datapoints, Cluster 1 almost entirely consists of them. For lumB, we can see that most of the points are in Cluster 2. HER-2 cancers seem to be the one category that is unable to be captured at all by our clusters. These data points are so spread out and it lead to Cluster 4 capturing a wide variety of subtypes instead.

c) Compare the clustering result from applying kmeans to the original data and the clustering result from applying kmeans to 4 PCs. Does PCA help in kmeans clustering? What might be the reasons if PCA helps?

We can compare the discrepancy tables for the original data and after applying kmeans to the 4PCs.

##             
## brca_subtype   1   2   3   4
##        Basal  17   3   1 187
##        Her2    9  23  43  16
##        LumA   86 147 395   0
##        LumB   22 105 104   2
##             
## brca_subtype   1   2   3   4
##        Basal   1   3 187  17
##        Her2   27  27  28   9
##        LumA  376 161   0  91
##        LumB   99 110   2  22

From these tables, we see similar performance with grouping the basal subtype, although some of the others seemingly performing worse with PCA. Particularly, Her-2 still performs pretty bad under both clustering, but is noticably more spread out amongst the clusters with PCA. LumA also still has a high concentration in one cluster, but noticably less points in that cluster with PCA with more in other clusters. The only subtype that seems to do marginally better with PCA is the LumB subtype. In our case, PCA likely only helped speed up computation of our clusters, as it doesn’t seem to have had any improvement with grouping the subtypes.

d) Now we have an x patient with breast cancer but with unknown sub-type. We have this patient's mRNA sequencing data. Project this x patient to the space of PC1 and PC2. (Hint: remember we remove some gene with no counts or no variablity, take log and centered) Plot this patient in the plot in iv) with a black dot. Calculate the Euclidean distance between this patient and each of centroid of the cluster. Can you tell which sub-type this patient might have? 

We will add this x patient to the graph with a dark red dot to distinguish it from the black dots which represent the cluster centers.

## List of 93
##  $ line                      :List of 6
##   ..$ colour       : chr "black"
##   ..$ size         : num 0.5
##   ..$ linetype     : num 1
##   ..$ lineend      : chr "butt"
##   ..$ arrow        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_line" "element"
##  $ rect                      :List of 5
##   ..$ fill         : chr "white"
##   ..$ colour       : chr "black"
##   ..$ size         : num 0.5
##   ..$ linetype     : num 1
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ text                      :List of 11
##   ..$ family       : chr ""
##   ..$ face         : chr "plain"
##   ..$ colour       : chr "black"
##   ..$ size         : num 11
##   ..$ hjust        : num 0.5
##   ..$ vjust        : num 0.5
##   ..$ angle        : num 0
##   ..$ lineheight   : num 0.9
##   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ title                     : NULL
##  $ aspect.ratio              : NULL
##  $ axis.title                : NULL
##  $ axis.title.x              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 2.75points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title.x.top          :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 0
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 2.75points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title.x.bottom       : NULL
##  $ axis.title.y              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : num 90
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 2.75points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title.y.left         : NULL
##  $ axis.title.y.right        :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 0
##   ..$ angle        : num -90
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.75points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text                 :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "grey30"
##   ..$ size         : 'rel' num 0.8
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.x               :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 2.2points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.x.top           :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 0
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 2.2points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.x.bottom        : NULL
##  $ axis.text.y               :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 1
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 2.2points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.y.left          : NULL
##  $ axis.text.y.right         :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.2points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.ticks                :List of 6
##   ..$ colour       : chr "grey20"
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ lineend      : NULL
##   ..$ arrow        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_line" "element"
##  $ axis.ticks.x              : NULL
##  $ axis.ticks.x.top          : NULL
##  $ axis.ticks.x.bottom       : NULL
##  $ axis.ticks.y              : NULL
##  $ axis.ticks.y.left         : NULL
##  $ axis.ticks.y.right        : NULL
##  $ axis.ticks.length         : 'simpleUnit' num 2.75points
##   ..- attr(*, "unit")= int 8
##  $ axis.ticks.length.x       : NULL
##  $ axis.ticks.length.x.top   : NULL
##  $ axis.ticks.length.x.bottom: NULL
##  $ axis.ticks.length.y       : NULL
##  $ axis.ticks.length.y.left  : NULL
##  $ axis.ticks.length.y.right : NULL
##  $ axis.line                 : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ axis.line.x               : NULL
##  $ axis.line.x.top           : NULL
##  $ axis.line.x.bottom        : NULL
##  $ axis.line.y               : NULL
##  $ axis.line.y.left          : NULL
##  $ axis.line.y.right         : NULL
##  $ legend.background         :List of 5
##   ..$ fill         : NULL
##   ..$ colour       : logi NA
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ legend.margin             : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
##   ..- attr(*, "unit")= int 8
##  $ legend.spacing            : 'simpleUnit' num 11points
##   ..- attr(*, "unit")= int 8
##  $ legend.spacing.x          : NULL
##  $ legend.spacing.y          : NULL
##  $ legend.key                :List of 5
##   ..$ fill         : chr "white"
##   ..$ colour       : logi NA
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ legend.key.size           : 'simpleUnit' num 1.2lines
##   ..- attr(*, "unit")= int 3
##  $ legend.key.height         : NULL
##  $ legend.key.width          : NULL
##  $ legend.text               :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : 'rel' num 0.8
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ legend.text.align         : NULL
##  $ legend.title              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ legend.title.align        : NULL
##  $ legend.position           : chr "right"
##  $ legend.direction          : NULL
##  $ legend.justification      : chr "center"
##  $ legend.box                : NULL
##  $ legend.box.just           : NULL
##  $ legend.box.margin         : 'margin' num [1:4] 0cm 0cm 0cm 0cm
##   ..- attr(*, "unit")= int 1
##  $ legend.box.background     : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ legend.box.spacing        : 'simpleUnit' num 11points
##   ..- attr(*, "unit")= int 8
##  $ panel.background          :List of 5
##   ..$ fill         : chr "white"
##   ..$ colour       : logi NA
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ panel.border              :List of 5
##   ..$ fill         : logi NA
##   ..$ colour       : chr "grey20"
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ panel.spacing             : 'simpleUnit' num 5.5points
##   ..- attr(*, "unit")= int 8
##  $ panel.spacing.x           : NULL
##  $ panel.spacing.y           : NULL
##  $ panel.grid                :List of 6
##   ..$ colour       : chr "grey92"
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ lineend      : NULL
##   ..$ arrow        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_line" "element"
##  $ panel.grid.major          : NULL
##  $ panel.grid.minor          :List of 6
##   ..$ colour       : NULL
##   ..$ size         : 'rel' num 0.5
##   ..$ linetype     : NULL
##   ..$ lineend      : NULL
##   ..$ arrow        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_line" "element"
##  $ panel.grid.major.x        : NULL
##  $ panel.grid.major.y        : NULL
##  $ panel.grid.minor.x        : NULL
##  $ panel.grid.minor.y        : NULL
##  $ panel.ontop               : logi FALSE
##  $ plot.background           :List of 5
##   ..$ fill         : NULL
##   ..$ colour       : chr "white"
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ plot.title                :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : 'rel' num 1.2
##   ..$ hjust        : num 0
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 5.5points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ plot.title.position       : chr "panel"
##  $ plot.subtitle             :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 5.5points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ plot.caption              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : 'rel' num 0.8
##   ..$ hjust        : num 1
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 5.5points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ plot.caption.position     : chr "panel"
##  $ plot.tag                  :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : 'rel' num 1.2
##   ..$ hjust        : num 0.5
##   ..$ vjust        : num 0.5
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ plot.tag.position         : chr "topleft"
##  $ plot.margin               : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
##   ..- attr(*, "unit")= int 8
##  $ strip.background          :List of 5
##   ..$ fill         : chr "grey85"
##   ..$ colour       : chr "grey20"
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ strip.background.x        : NULL
##  $ strip.background.y        : NULL
##  $ strip.placement           : chr "inside"
##  $ strip.text                :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "grey10"
##   ..$ size         : 'rel' num 0.8
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 4.4points 4.4points 4.4points 4.4points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ strip.text.x              : NULL
##  $ strip.text.y              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : num -90
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ strip.switch.pad.grid     : 'simpleUnit' num 2.75points
##   ..- attr(*, "unit")= int 8
##  $ strip.switch.pad.wrap     : 'simpleUnit' num 2.75points
##   ..- attr(*, "unit")= int 8
##  $ strip.text.y.left         :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : num 90
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi TRUE
##  - attr(*, "validate")= logi TRUE

##   1   2   3   4 
## 498 534 135 577

We can see that the euclidean distance to cluster 1’s centroid is 498, to cluster 2’s it’s 534, to cluster 3’s it’s 135 and to cluster 4’s it’s 577. Using these distances we can say that patient x is part of cluster 3, which as discussed before, almost entirely consists of those with basal-like breast cancer. This means the patient might have the subtype of basal-like breast cancer.

3 Case study 3: Auto data set

This question utilizes the Auto dataset from ISLR. The original dataset contains 408 observations about cars. It is similar to the CARS dataset that we use in our lectures. To get the data, first install the package ISLR. The Auto dataset should be loaded automatically. We’ll use this dataset to practice the methods learn so far. Original data source is here: https://archive.ics.uci.edu/ml/datasets/auto+mpg

Get familiar with this dataset first. Tip: you can use the command ?ISLR::Auto to view a description of the dataset.

3.1 EDA

Explore the data, with particular focus on pairwise plots and summary statistics. Briefly summarize your findings and any peculiarities in the data.

## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
##       mpg         cylinders     displacement   horsepower        weight    
##  Min.   : 9.0   Min.   :3.00   Min.   : 68   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.0   1st Qu.:4.00   1st Qu.:105   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.8   Median :4.00   Median :151   Median : 93.5   Median :2804  
##  Mean   :23.4   Mean   :5.47   Mean   :194   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.0   3rd Qu.:8.00   3rd Qu.:276   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.6   Max.   :8.00   Max.   :455   Max.   :230.0   Max.   :5140  
##                                                                            
##   acceleration       year        origin                     name    
##  Min.   : 8.0   Min.   :70   Min.   :1.00   amc matador       :  5  
##  1st Qu.:13.8   1st Qu.:73   1st Qu.:1.00   ford pinto        :  5  
##  Median :15.5   Median :76   Median :1.00   toyota corolla    :  5  
##  Mean   :15.5   Mean   :76   Mean   :1.58   amc gremlin       :  4  
##  3rd Qu.:17.0   3rd Qu.:79   3rd Qu.:2.00   amc hornet        :  4  
##  Max.   :24.8   Max.   :82   Max.   :3.00   chevrolet chevette:  4  
##                                             (Other)           :365
## [1] 0

Note that there are no NA values in the data set, meaning there are no missing values.

## [1] -0.832

By looking at the above scatter plot and the correlation between the weight of the car and the MPG, we can said that there is a strong, negative correlation between the two variables, -0.832 is a strong indication that as the weight of the car increases, the MPG will decrease.

We can see that the majority of the cars have an even number of cylinders. This is likely for efficiency in the balancing of the engine and allowing for symmetry when making the car.

## [1] -0.689

From the above graph and the correlation between horsepower and acceleration, we can see that as the horsepower for a car increases, the acceleration decreases.

Although there are years where the median MPG of all the cars decreases, in general as the years go by there is an increase in MPG of the cars. This could be due to new developments in technology and inventions and innovations.

3.2 What effect does time have on MPG?

  1. Start with a simple regression of mpg vs. year and report R’s summary output. Is year a significant variable at the .05 level? State what effect year has on mpg, if any, according to this model.
## 
## Call:
## lm(formula = mpg ~ year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.021  -5.441  -0.441   4.974  18.209 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -70.0117     6.6452   -10.5   <2e-16 ***
## year          1.2300     0.0874    14.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.36 on 390 degrees of freedom
## Multiple R-squared:  0.337,  Adjusted R-squared:  0.335 
## F-statistic:  198 on 1 and 390 DF,  p-value: <2e-16

Answer: At the 0.05 level, we can see that year is a significant variable, as the p-value is <2e-16 and that is significantly less than 0.05. This means that we would reject the null hypothesis that \(\beta_{year} = 0\). Thus we get that year has some effect on MPG. According to this model, as the year increases by one, we see that MPG increases by 1.23 mpg.

  1. Add horsepower on top of the variable year to your linear model. Is year still a significant variable at the .05 level? Give a precise interpretation of the year’s effect found here.
## 
## Call:
## lm(formula = mpg ~ year + horsepower, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.077  -3.078  -0.431   2.588  15.315 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.73917    5.34903   -2.38    0.018 *  
## year          0.65727    0.06626    9.92   <2e-16 ***
## horsepower   -0.13165    0.00634  -20.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.39 on 389 degrees of freedom
## Multiple R-squared:  0.685,  Adjusted R-squared:  0.684 
## F-statistic:  424 on 2 and 389 DF,  p-value: <2e-16

Answer: At the 0.05 level, we can see that year is a significant variable, as the p-value is <2e-16 and that is significantly less than 0.05. This means that we would reject the null hypothesis that \(\beta_{year} = 0\).

According to this model, holding horsepower constant, as the year increases by one, we see that MPG increases by 0.65727 mpg.

  1. The two 95% CI’s for the coefficient of year differ among (i) and (ii). How would you explain the difference to a non-statistician?

For the confidence interval we will use 1.96 as there are greater than 30 observations so we can assume a normal sampling distribution.

## [1] 1.06
## [1] 1.4

For the first confidence interval we have [1.06, 1.40].

## [1] 0.527
## [1] 0.787

The second confidence interval is [0.527, 0.767]

+Answer: The reason that the confidence intervals differ is because they are looking at different characteristics of the car. The first confidence interval is looking at how only the year affects the mpg of a car. So this would mean that each year the car you are test has the same length, weight, cylinders, displacement, etc. and the only difference in the year. The second confidence interval is changing two different characteristics and seeing how they affect the mpg, those being horsepower and year. The reason that the interval for the coefficient is different here is because you are changing more features of the car as you test it.

  1. Create a model with interaction by fitting lm(mpg ~ year * horsepower). Is the interaction effect significant at .05 level? Explain the year effect (if any).
## `geom_smooth()` using formula 'y ~ x'

Here we can see that for each year the slope for the interaction between horsepower and mpg differs, however we will run a model and confirm with a hypothesis test that the year has some impact.

## 
## Call:
## lm(formula = mpg ~ year * horsepower, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.349  -2.451  -0.456   2.406  14.444 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.27e+02   1.21e+01  -10.45   <2e-16 ***
## year             2.19e+00   1.61e-01   13.59   <2e-16 ***
## horsepower       1.05e+00   1.15e-01    9.06   <2e-16 ***
## year:horsepower -1.60e-02   1.56e-03  -10.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.9 on 388 degrees of freedom
## Multiple R-squared:  0.752,  Adjusted R-squared:  0.75 
## F-statistic:  393 on 3 and 388 DF,  p-value: <2e-16
## Analysis of Variance Table
## 
## Model 1: mpg ~ year + horsepower
## Model 2: mpg ~ year * horsepower
##   Res.Df  RSS Df Sum of Sq   F Pr(>F)    
## 1    389 7491                            
## 2    388 5903  1      1588 104 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: mpg
##                  Df Sum Sq Mean Sq F value Pr(>F)    
## year              1   8028    8028     528 <2e-16 ***
## horsepower        1   8300    8300     546 <2e-16 ***
## year:horsepower   1   1588    1588     104 <2e-16 ***
## Residuals       388   5903      15                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer: Here we are modeling horsepower by year. This allows us to see if horsepower depends at all on the year.

We do reject the null hypothesis with a p-value of <2e-16. That indicates some interaction effect of HP and Year.

3.3 Categorical predictors

Remember that the same variable can play different roles! Take a quick look at the variable cylinders, and try to use this variable in the following analyses wisely. We all agree that a larger number of cylinders will lower mpg. However, we can interpret cylinders as either a continuous (numeric) variable or a categorical variable.

  1. Fit a model that treats cylinders as a continuous/numeric variable. Is cylinders significant at the 0.01 level? What effect does cylinders play in this model?

First let’s only include the variables that make sense for the regression

## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"
## 
## Call:
## lm(formula = mpg ~ ., data = auto2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.590 -2.157 -0.117  1.869 13.060 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.72e+01   4.64e+00   -3.71  0.00024 ***
## cylinders    -4.93e-01   3.23e-01   -1.53  0.12780    
## displacement  1.99e-02   7.51e-03    2.65  0.00844 ** 
## horsepower   -1.70e-02   1.38e-02   -1.23  0.21963    
## weight       -6.47e-03   6.52e-04   -9.93  < 2e-16 ***
## acceleration  8.06e-02   9.88e-02    0.82  0.41548    
## year          7.51e-01   5.10e-02   14.73  < 2e-16 ***
## origin        1.43e+00   2.78e-01    5.13  4.7e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.33 on 384 degrees of freedom
## Multiple R-squared:  0.821,  Adjusted R-squared:  0.818 
## F-statistic:  252 on 7 and 384 DF,  p-value: <2e-16

Answer: With a significance level of 0.1278, we fail to reject the null hypothesis that \(\beta_{cylinders} = 0\). Thus when cylinders is accounted for as a numerical variable, it cannot be considered significant.

  1. Fit a model that treats cylinders as a categorical/factor. Is cylinders significant at the .01 level? What is the effect of cylinders in this model? Describe the cylinders effect over mpg.
## Analysis of Variance Table
## 
## Response: mpg
##               Df Sum Sq Mean Sq F value  Pr(>F)    
## cylinders      4  15275    3819   398.1 < 2e-16 ***
## displacement   1   1098    1098   114.5 < 2e-16 ***
## horsepower     1    588     588    61.3 4.9e-14 ***
## weight         1    715     715    74.5 < 2e-16 ***
## acceleration   1      8       8     0.8    0.37    
## year           1   2245    2245   234.1 < 2e-16 ***
## origin         1    235     235    24.5 1.1e-06 ***
## Residuals    381   3655      10                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer: Cylinders here, is very significant as a categorical/factor variable. With a p-value of < 2e-16, we reject the null hypothesis that cylinders has no impact on MPG. Rather is it a significant variable, with the coefficient not equal to zero.

  1. What are the fundamental differences between treating cylinders as a continuous and categorical variable in your models?

Answer: Although the variable cylinders is numeric in nature, it is highly, non linear as we saw in it’s histogram in the EDA. This means that it is more valuable to consider this variable as a factor instead of it’s numeric value. By having cylinders as a factor, we can look at the affect that the number of cylinders has on the car by breaking up the one variable into many. We can analyze the different affects of the cylinders since it is not significant when viewed as a numerical value. In fact, we can see in the summary of our regression that each of the cylinders has a different affect on the mpg, with different coefficients and p-values.

  1. Can you test the null hypothesis: fit0: mpg is linear in cylinders vs. fit1: mpg relates to cylinders as a categorical variable at .01 level?
## Analysis of Variance Table
## 
## Model 1: mpg ~ cylinders + displacement + horsepower + weight + acceleration + 
##     year + origin
## Model 2: mpg ~ cylinders + displacement + horsepower + weight + acceleration + 
##     year + origin
##   Res.Df  RSS Df Sum of Sq    F  Pr(>F)    
## 1    384 4252                              
## 2    381 3655  3       597 20.8 1.8e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a null hypothesis that the two different fits are the same, we reject the null hypothesis at a level of 0.01, as the p-value is 1.8e-12 which is less than 0.01. Thus, we get that the two fits are significantly different.

3.4 Results

Final modeling question: we want to explore the effects of each feature as best as possible. You may explore interactions, feature transformations, higher order terms, or other strategies within reason. The model(s) should be as parsimonious (simple) as possible unless the gain in accuracy is significant from your point of view.

  1. Describe the final model. Include diagnostic plots with particular focus on the model residuals and diagnoses.

We are once again gonna focus on all of the variables that make sense in a model. As we found out above, cylinders makes more sense as a factor so once again we will have that. We also below make origin into a factor variable.

We will be using backwards elimination to eliminate variables until all of them are significant. We will use a significant value of 0.05 for the t-tests.

First, we start with the model including all possible predictors.

## Single term deletions
## 
## Model:
## mpg ~ cylinders + displacement + horsepower + weight + acceleration + 
##     year + origin
##              Df Sum of Sq  RSS  AIC F value  Pr(>F)    
## <none>                    3646  898                    
## cylinders     4       566 4213  947   14.76 3.2e-11 ***
## displacement  1        64 3711  903    6.71  0.0100 ** 
## horsepower    1        67 3713  903    6.96  0.0087 ** 
## weight        1       804 4450  974   83.79 < 2e-16 ***
## acceleration  1         1 3647  896    0.08  0.7802    
## year          1      2177 5824 1080  226.91 < 2e-16 ***
## origin        2       244 3890  920   12.71 4.5e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For fit0, acceleration has the highest p-value of 0.7802. It is also greater than 0.05 which means we will drop that variable from the model.

## Single term deletions
## 
## Model:
## mpg ~ cylinders + displacement + horsepower + weight + year + 
##     origin
##              Df Sum of Sq  RSS  AIC F value  Pr(>F)    
## <none>                    3647  896                    
## cylinders     4       574 4221  946   15.00 2.1e-11 ***
## displacement  1        64 3711  901    6.65   1e-02 *  
## horsepower    1       115 3762  906   11.96   6e-04 ***
## weight        1      1014 4661  990  105.90 < 2e-16 ***
## year          1      2187 5834 1078  228.42 < 2e-16 ***
## origin        2       245 3892  918   12.78 4.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After dropping acceleration and running the associated F-tests, we get that all of the p-values are less than 0.05. We then want to check the plots for our new model to make sure all our assumptions are met.

We can see that both of the plots look normal. Thus we will use fit1 as our final model.

  1. Summarize the effects found.

Our final model includes the variables cylinders, displacement, horsepower, weight, year, and origin to predict the mpg for a car. In our model, we have both cylinders and origin as factors since the variables are non-linear and are not of significance in a linear model if kept as numerical values. Lastly, we performing backwards regression, we found that the variable of acceleration is not significant and could be eliminated.

  1. Predict the mpg of the following car: A red car built in the US in 1983 that is 180 inches long, has eight cylinders, displaces 350 cu. inches, weighs 4000 pounds, and has a horsepower of 260. Also give a 95% CI for your prediction.
##    fit lwr  upr
## 1 18.6  12 25.2

The prediction interval for the given car is [12 mpg, 25.2 mpg]. This means that we can be 95% confident that the next new observation of a car with these specifications will fall within this range of 12 mpg to 25.2 mpg.

##    fit  lwr  upr
## 1 18.6 16.1 21.2

The confidence interval for the car with the above specifications is [16.1 mpg, 21.2 mpg]. If an infinite number of samples were drawn from the same population of this car and a 95% CI calculated for each sample, we would expect the population mean to be found within 95% of these CIs. We also can note that the confidence interval is narrower as the prediction interval has an added uncertainty of predicting the exact value rather than the mean value.